Exploratory data analysis exercise. Honors project.

Pablo Hinojosa Lopez

The set of data chosen is about the tragic episode happened on April 15, 1912, when the famous Titanic collided with an iceberg, broken the main structure of the ship. Unfortunately, there were no enough lifeboats on board for all the people, resulting this in the death of 1502 passengers.

The main purpose of this analysis is to find out if there was some passanger characteristics that could lead to have more advantages to survive to this tragedy.

¿Rich people had preferences?

¿Kids and woman had preferences?

¿Depending on the cabin colocation some passenger were able to arrive before to these boats?

I will try to shape and respond all these kind of questions.

Variable description

Let's analyse the different variables in our data frame and hint the most interesting details about each of them

As we could imagine, the target variable will be 'Survived' column which determines if the passenger finally survived or not.

Also with the info we can see that there are some variables that have missing data so we will manage this later

Explore numerical features

There is no quasi-constant features, so is not needed to drop extra column.

Here in these previous plots we can see that, as we could intuit, there are more passenger with low cost fares than passengers with high cost fares.

Also its important to notice that the mean age is about 25 - 30 years old.

Apparently the main variables that will affect to the target variable will be in relation with the fare and class of the ticket.

This hint could lead us to think that the money was an important factor to determine if a passenger will survive or not. Also to know that wealth in general is an important parameter to take into account.

Also we note that class and age are stronly correlated, which is logical because when you are older, your monetary capacity increase along the years.

Because there is no features highly correlated with the target variable, we could proceed with the representation of the relation with the sightly correlated variables.

In addition to plot and inspect the correlations, it is interesting to plot the scatter plots for all the other numerical features that are correlated between them.

As we can see this pairplot does not offer so much information about the relations between variable. Lets try to plot some special relationship that could lead us to have more idea about data relations.

It will be interesting to plot relation between fare and class (which is a natural relation). Also it will be interesting the relation between survived and age and also between survived and number of sons, parents.

It can be seen how there is not much relationship between age and survival, except for an age range such as babies, in which case it does seem to affect survival. Let's proceed to analyze that age range to check if, according to the story, babies and children had preference.

Let's see how could affect the number of members in the family to the 'Survived' variable. To do this, I am going to do a little of Feature Engineering to transform a little bit the data.

As we can see in the graph, appear to not be very importan to have more family or not.

Also as we can see, it appears to be a sightly correlation between age and surviving factors plotted in previous graphs. No it is turn to analyse the missing values in numerical features.

As we can see, the SimpleImputer is not useful in this case because it only replaces with the same data and provokes a distribution irreal perturbation.

Categorical Features

Now I will perform the same analysis for the categorical features in our dataset.

The previous plots show that there is no categorical feature in which only a value is predominant, so that we could delete that feature because not contains any useful information.

As we can see in this case and because the target variable is a binary variable, this type of plots is not useful, so lets try another kind of representation.

We can see that appears to influence the genre and more females survived to the disaster, so we could model it taking into account this influence.

Here we could appreciate also that the influence of the class and genre is pretty important in this target value.

Feature Engineering

Let's do some feature engineering in order to extract some extra information of the available dataset

Here we could see the influence of the class in the survival variable. Also here in the embarked variable we can appreciate that in Q,C has more possibilities to survived than S.

Data visualization

This graph confirm that apparently the number of relatives is not important.

As I said previously, the class is pretty influent in the outcome situation for each person

It's important this analysis to notice that women and people with high society titles has more preferences in the survived outcome.

Let's handle the type of ticket that has more possibilities.

Hyphothesis Testing

Null Hypothesis(H0): male mean is greater or equal to female mean.

Alternative Hypothesis(H1): male mean is less than female mean.

Determine the test statistics: This will be a two-tailed test since the difference between male and female passenger in the survival rate could be higher or lower than 0. Since we do not know the standard deviation( σ ) and n is small, we will use the t-distribution.

Specify the significance level: Specifying a significance level is an important step of the hypothesis test. It is an ultimate balance between type 1 error and type 2 error. We will discuss more in-depth about those in another lesson. For now, we have decided to make our significance level( α ) = 0.05. So, our confidence interval or non-rejection region would be (1 - α )=(1-0.05) = 95%.

Computing T-statistics and P-value: Let's take a random sample and see the difference.

According to the samples our male samples ( x¯m ) and female samples( x¯f ) mean measured difference is ~ 0.6(statistically this is called the point estimate of the male population mean and female population mean). keeping in mind that...

As we can see, the null hypothesis is not true, so that our test determine to get the alternative hypothesis, which tell us that the mean on survived woman is bigger than mean of survived man

Conclusion

To sum up, we have detected the most important variables and its main level of influence ibn the target variable.